AITopics | value function

Collaborating Authors

value function

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Bilevel Optimization over Saddle Points of Zero-Sum Markov Games

Zheng, Zihao, King, Irwin, Lu, Songtao

arXiv.org Machine LearningMay-27-2026

Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-level (LL) decision-making process responds, naturally leading to a bilevel optimization problem. Most existing bilevel RL methods assume a single-policy LL Markov decision process (MDP), and therefore fail to capture competitive structures arising in applications such as incentive design, where multiple policies interact. We study bilevel optimization problems in which the LL problem is a regularized min-max zero-sum Markov game and the UL objective is optimized through the saddle-point equilibrium induced by the LL game. In this work, we propose penalty-augmented Nikaido-Isoda descent-ascent (PANDA), a penalty-based first-order policy-gradient method based on the Nikaido-Isoda function. By exploiting the min-max game structure, PANDA avoids computing UL hypergradients and does not require second-order information. We prove that PANDA converges to stationary points without convexity assumptions on either the UL or LL objectives. Moreover, PANDA reaches an $ε$-stationary point in $\tilde{\mathcal{O}}(ε^{-1})$ iterations with sample complexity $\tilde{\mathcal{O}}(ε^{-3})$, matching the best-known rates for bilevel RL with single-policy LL MDPs. Experiments demonstrate the superior performance of PANDA over closely related baselines.

artificial intelligence, machine learning, optimization problem, (14 more...)

arXiv.org Machine Learning

2605.26654

Country: Asia (0.27)

Genre: Research Report (0.81)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.67)
Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (0.67)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.46)

Add feedback

Learning Kernel-Based MDPs from Episodic Preferential Feedback

Pavlovic, Nikola, Vakili, Sattar, Zhao, Qing

arXiv.org Machine LearningMay-26-2026

Human feedback often arrives as preferences rather than calibrated numeric rewards, motivating reinforcement learning from preferential feedback, also referred to as reinforcement learning from human feedback (RLHF). We present a rigorous theoretical study of preference-only learning in episodic kernel MDPs. In each episode, the learner deploys two policies from a common start state and receives a single binary label indicating which trajectory is preferred, modeled by a Bradley--Terry--Luce link on the difference of cumulative (unobserved) rewards. Under kernel-based assumptions on the reward and transition functions (one of the most general models amenable to theoretical analysis) we develop preference-based value estimation and confidence sets tailored to end-of-episode comparisons. We prove high-probability regret bounds that scale sublinearly in the number of episodes, implying that the value of the learned policy converges to that of the optimal policy.

artificial intelligence, machine learning, reinforcement learning, (20 more...)

arXiv.org Machine Learning

2605.2365

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.54)

Add feedback

On the Sample Complexity of Discounted Reinforcement Learning with Optimized Certainty Equivalents

Mortensen, Oliver, Talebi, Mohammad Sadegh

arXiv.org Machine LearningMay-22-2026

We study risk-sensitive reinforcement learning in finite discounted MDPs, where a generative model of the MDP is assumed to be available. We consider a family or risk measures called the optimized certainty equivalent (OCE), which includes important risk measures such as entropic risk, CVaR, and mean-variance. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive OCE. We provide an exact characterization of utility functions $u$ for which the corresponding OCE defines an objective that is PAC-learnable. We analyze a simple model-based approach and derive PAC sample complexity bounds. We establish that whenever $u$ does not have full domain $\text{dom}(u)\neq \mathbb{R}$, the corresponding problem is not PAC-learnable. Finally, we establish corresponding lower bounds for both value and policy learning, demonstrating tightness in the size $SA$ of state-action space, and for a more restricted class of utilities, we derive lower bounds that makes the dependence on the effective horizon $\frac{1}{1-γ}$ explicit. Specifically, for $\text{CVaR}_τ$ we show that the correct dependence on $τ$ is $\frac{1}{τ^2}$, thus improving by a factor of $\frac{1}τ$ over state-of-the-art although our bound has a suboptimal dependence on $\frac{1}{1-γ}$.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

arXiv.org Machine Learning

2605.21763

Country: Europe (0.46)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback

Generative Modeling by Value-Driven Transport

Moreno-Muñoz, Pablo, Müller, Adrian, Neu, Gergely

arXiv.org Machine LearningMay-22-2026

We propose a new framework for generative modeling based on a discrete-time stochastic control formulation of measure transport. Adapting classic results from control theory, we formulate our problem as a linear program whose dual variables correspond to the \emph{optimal value function} of the control problem, which directly encodes the optimal control policy. Exploiting this LP formulation, we develop an efficient simulation-free primal-dual algorithm for computing approximately optimal value functions and the associated \emph{value-driven transport} (VDT) policies which approximate the true optimal policy. We show that well-trained VDT policies enjoy numerous favorable properties in comparison with other state-of-the-art methods based on flows, diffusions, or Schrödinger bridges: they lead to straight transport paths which can be simulated quickly and robustly, and can be enhanced in all the same ways as diffusion and flow-based models (e.g., conditional generation, classifier-free guidance, unpaired data-to-data translation are all easy to incorporate). We evaluate our methodology in a range of experiments, with results that indicate strong performance and good potential for scalability.

artificial intelligence, machine learning, src, (18 more...)

arXiv.org Machine Learning

2605.22507

Country:

Europe (0.92)
North America > United States (0.67)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Mathematical & Statistical Methods (0.66)

Add feedback

Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

Gong, Shijin, Ye, Kai, Zhu, Jin, Zhang, Xinyu, Zhou, Hongyi, Shi, Chengchun

arXiv.org Machine LearningMay-19-2026

Recent advances in large language models (LLMs) have increasingly relied on reinforcement learning (RL) to improve their reasoning capabilities. Three types of approaches have been widely adopted: The first relies on a deep neural network to estimate the value function of the learning policy in order to reduce the variance of the policy gradient. However, estimating and maintaining such a value network incurs substantial computational and memory overhead. The second avoids training a value network by approximating the value function using sample averages. However, it samples a large number of reasoning traces per prompt for accurate value function approximation, making it computationally expensive. The third samples only a single reasoning trajectory per prompt, which reduces computational cost but suffers from poor sample efficiency. This paper focuses on a practical, resource-constrained setting in which only a small number of reasoning traces can be sampled per prompt, while low-variance gradient estimation remains essential for high-quality policy learning. To address this challenge, we bring classical nonparametric statistical methods, which are both computationally and statistically efficient, to LLM reasoning. We employ kernel smoothing as a concrete example for value function estimation and the subsequent policy optimization. Numerical and theoretical results demonstrate that our proposal achieves accurate value and gradient estimation, leading to improved policy optimization.

large language model, machine learning, reinforcement learning, (19 more...)

arXiv.org Machine Learning

2604.28005

Genre: Research Report > New Finding (0.87)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (0.89)

Add feedback

Reinforcement Learning Measurement Model

Xu, Wenqian, Ji, Feng

arXiv.org Machine LearningMay-12-2026

Interactive assessments generate sequential process data that are not well handled by conventional item response models. Existing MDP-based measurement approaches, such as the Markov decision process measurement model (MDP-MM, LaMar, 2018), link action choices to state-action values, but their reliance on person-specific tabular value functions makes them difficult to scale beyond small, fully enumerated tasks. We propose the Reinforcement Learning Measurement Model (RLMM), a measurement framework that decouples person-level choice sensitivity from task-level value representation through a shared parametric action-value function, making estimation more computationally efficient for larger process-data settings. The model combines a Boltzmann choice rule with normalized advantages, a soft Bellman consistency penalty, and a block-coordinate MAP procedure for joint estimation, while also yielding step-level influence diagnostics for identifying behaviorally critical decisions. In peg-solitaire simulations, the RLMM achieved higher estimation accuracy and substantially lower runtime than the original MDP-MM, with advantages increasing as task complexity grew. In AQUALAB gameplay logs, the estimated person parameter was positively associated with cumulative reward, task completion, and behavioral efficiency. These results show that the RLMM extends decision-process-based psychometric models to larger and more behaviorally realistic environments while preserving an interpretable latent trait tied to decision making steps.

artificial intelligence, machine learning, measurement model, (17 more...)

arXiv.org Machine Learning

2605.09305

Country: North America > Canada (0.28)

Genre: Research Report (0.70)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (1.00)

Add feedback

Improved Model-based Reinforcement Learning with Smooth Kernels

Long, Kun, Li, Yuqiang, Wu, Xianyi

arXiv.org Machine LearningMay-11-2026

For continuous state-action space scenarios, classical reinforcement learning (RL) theory predominantly focuses on low-rank Markov decision processes (MDPs), which provide sample-efficient guarantees at the expense of restrictive structural assumptions. Kernel smoothing model-based approaches offer a promising alternative paradigm that instead leverages the smoothness of the MDP and employs non-parametric kernel smoothing estimates of transition dynamics. This paper proposes a new kernel-smoothing model-based approach for online reinforcement learning in finite-horizon settings under Lipschitz continuity assumptions on the MDP. By incorporating a Bernstein-style exploration bonus into the kernel smoothing framework, our method achieves a regret bound which improves upon the state-of-the-art regret bound in its dependence on the horizon. The theoretical advancement relies on a delicate analysis of the synergy between Bernstein-style bonuses and kernel smoothing, where a new tight Bernstein-type concentration inequality for martingales may be of independent interest.

artificial intelligence, machine learning, reinforcement learning, (13 more...)

arXiv.org Machine Learning

2605.07218

Country: Asia (0.28)

Genre: Research Report (0.64)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Undirected Networks > Markov Models (0.48)

Add feedback

SOC-ICNN: From Polyhedral to Conic Geometry for Learning Convex Surrogate Functions

Liu, Kang, Hu, Jianchen, Peng, Wei

arXiv.org Machine LearningMay-6-2026

Classical ReLU-based Input Convex Neural Networks (ICNNs) are equivalent to the optimal value functions of Linear Programming (LP). This intrinsic structural equivalence restricts their representational capacity to piecewise-linear polyhedral functions. To overcome this representational bottleneck, we propose the SOC-ICNN, an architecture that generalizes the underlying optimization class from LP to Second-Order Cone Programming (SOCP). By explicitly injecting positive semi-definite curvature and Euclidean norm-based conic primitives, our formulation introduces native smooth curvature into the representation while preserving a rigorous optimization-theoretic interpretation. We formally prove that SOC-ICNNs strictly expand the representational space of ReLU-ICNNs without increasing the asymptotic order of forward-pass complexity. Extensive experiments demonstrate that SOC-ICNN substantially improves function approximation, while delivering competitive downstream decision quality. The code is available at https://anonymous.4open.science/r/SOC-ICNN-4B18/.

artificial intelligence, machine learning, optimization problem, (20 more...)

arXiv.org Machine Learning

2604.22355

Country: North America > United States (0.28)

Genre: Research Report (0.63)

Industry: Energy > Power Industry (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.87)

Add feedback

Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds

Neural Information Processing SystemsMay-1-2026, 04:35:07 GMT

While numerous works have focused on devising efficient algorithms for reinforcement learning (RL) with uniformly bounded rewards, it remains an open question whether sample or time-efficient algorithms for RL with large state-action space exist when the rewards are heavy-tailed, i.e., with only finite (1+ϵ)-th moments for some ϵ (0,1]. In this work, we address the challenge of such rewards in RL with linear function approximation.

artificial intelligence, machine learning, reinforcement learning, (14 more...)

Neural Information Processing Systems

Country: North America > United States > California (0.28)

Technology: